Clustering and classification

# access the packages
library(MASS); library(corrplot); library(tidyr); library(corrplot); library(dplyr); library(ggplot2); 
## corrplot 0.84 loaded
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# load the data
data("Boston")

# explore the dataset
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Data Set Information

“Boston {MASS}” dataset consist of housing values in suburbs of Boston. The Boston data frame has 506 rows and 14 columns.

This data frame contains the following variables:

crim per capita crime rate by town.

zn proportion of residential land zoned for lots over 25,000 sq.ft.

indus proportion of non-retail business acres per town.

chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox nitrogen oxides concentration (parts per 10 million).

rm average number of rooms per dwelling.

age proportion of owner-occupied units built prior to 1940.

dis weighted mean of distances to five Boston employment centres.

rad index of accessibility to radial highways.

tax full-value property-tax rate per $10,000.

ptratio pupil-teacher ratio by town.

black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat lower status of the population (percent).

medv median value of owner-occupied homes in $1000s.

Overview of the data

# Change the shape of the data from wide-format to long-format
require(reshape2)
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
head(melt.boston)
##   variable   value
## 1     crim 0.00632
## 2     crim 0.02731
## 3     crim 0.02729
## 4     crim 0.03237
## 5     crim 0.06905
## 6     crim 0.02985
# draw a bar plot of each variable
ggplot(data = melt.boston, aes(x = value)) + stat_density() + facet_wrap(~variable, scales = "free")

# plot matrix of the Boston dataset variables
pairs(Boston)

# calculate the correlation matrix of the Boston dataset and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix of the dataset
corrplot(cor_matrix, method="number", type='upper', diag = FALSE)

Several of the variables are highly skewed.In particular, crim, zn, chaz, dis, and black are highly skewed. Some of the others appear to have moderate skewness. The skewed distributions suggests that some transformations on variables could improve performance of variables in the models. We can observe several highly correlated variables in the correlation matrix. We have to be careful with highly correlated variables to avoid overcome their influence in the models. The next thing we need to do is standardize the dataset and print out summaries of the scaled data, then create a categorical variable of the crime rate in the Boston dataset using the quantiles as the break points, drop the old crime rate variable from the dataset, and create training and testing data (80% of the data belongs to the train set).

The dataset standardizing and dividing to training and testing datasets

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'. Using the quantiles as the break points in the categorical variable.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Fit the linear discriminant analysis (LDA) on the train set

Now the test data has created. Next we going to fit the linear discriminant analysis on the train dataset. Notice that in this case we have four classes. The LDA algorithm starts by finding directions that maximize the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.

LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.

LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.

The lda() outputs contain the following elements:

Prior probabilities of groups: the proportion of training observations in each group. Group means: Shows the mean of each variable in each group. Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule.

source: http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/#linear-discriminant-analysis---lda

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2524752 0.2400990 0.2475248 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       0.9279095 -0.8746763 -0.122344297 -0.8696347  0.4031917 -0.8748606
## med_low  -0.1095431 -0.2764984 -0.079333958 -0.5692274 -0.1901812 -0.3339528
## med_high -0.3793843  0.1361663  0.255323541  0.3691752  0.1271016  0.4078957
## high     -0.4872402  1.0171519  0.003267949  1.0749693 -0.3366619  0.8258984
##                 dis        rad        tax     ptratio       black      lstat
## low       0.8671167 -0.6767071 -0.7197454 -0.43916657  0.38239325 -0.7722497
## med_low   0.3745215 -0.5461293 -0.4782323 -0.05825635  0.35586788 -0.1220811
## med_high -0.3729400 -0.4396053 -0.3562496 -0.27946113  0.05446032  0.0418992
## high     -0.8520910  1.6377820  1.5138081  0.78037363 -0.80374216  0.8960484
##                 medv
## low       0.49700462
## med_low  -0.04130269
## med_high  0.17253054
## high     -0.71618264
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.05009772  0.64207880 -0.950731976
## indus    0.10356583 -0.10204444  0.337704770
## chas    -0.13985022 -0.06768071  0.004420603
## nox      0.33079871 -0.80960156 -1.278168419
## rm      -0.17150377 -0.18122296 -0.259021417
## age      0.15039918 -0.33012445 -0.086823868
## dis     -0.04219737 -0.21247088  0.218490757
## rad      3.70570914  1.01252252 -0.097687093
## tax      0.03515543 -0.02130031  0.563338138
## ptratio  0.09048200 -0.08257425 -0.303352559
## black   -0.09004387  0.06428368  0.164388920
## lstat    0.20576348 -0.37925185  0.341327562
## medv     0.20924378 -0.48027914 -0.152088541
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9582 0.0308 0.0110
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The train data was devided in quantiles. The crime variable is as actarget variable. In the plot we see four different clusters. Three of them are in overlapped and one cluster is far away from other clusters. Look at the arrows tells us which of the affect most on the classification (rad, zn, nox) but because there is so much variables it is hard to recognize other variables.

Predict the classes with the LDA model on the test data

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       5        1    0
##   med_low    6      13        5    0
##   med_high   0       6       20    3
##   high       0       0        0   27
#Calculate accuracy percent of the model
correct_predicts <- 100 * mean(lda.pred$class==correct_classes)
correct_predicts <- round(correct_predicts, digits = 0)

#Print correct predicts percentage
print(correct_predicts)
## [1] 75

We split our data earlier so that we have the test set and the correct class labels. The prediction model perform on test data is acceptable but not perfect (prediction accuracy is 75%). It predicts high crime rate perfectly but lower rates worse.

K-means clustering

“Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.” (https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a)

# load the data
data("Boston")

# Standardizing Boston dataset
scaled_boston <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(scaled_boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(scaled_boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(scaled_boston, centers = 3)

# plot the scaled_oston dataset with clusters
pairs(scaled_boston, col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scaled_boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(scaled_boston, centers = 3)

# plot the scaled_boston dataset with clusters
pairs(scaled_boston, col = km$cluster)

I tested many different number of clusters. Based on visualiztion the results suggest that 3 is the optimal number of clusters as it appears to be the bend in the elbow (= when the total WCSS drops radically).

Bonus

# load the data
data("Boston")

# Standardizing Boston dataset
scaled_kmeans_boston <- scale(Boston)

scaled_kmeans_boston <- as.data.frame(scaled_kmeans_boston)

# k-means clustering
km <-kmeans(scaled_kmeans_boston, centers = 3)

lda_kmeans <- lda(km$cluster ~ ., data = scaled_kmeans_boston)
lda_kmeans
## Call:
## lda(km$cluster ~ ., data = scaled_kmeans_boston)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.3260870 0.4268775 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
## 3 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2  0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338 -0.19094456
## zn       0.02935900 -1.07677218
## indus    0.63347352 -0.09917524
## chas     0.02460719  0.10009606
## nox      1.11749317 -0.75995105
## rm      -0.18841682 -0.57360135
## age     -0.12983139  0.47226685
## dis      0.04493809 -0.34585958
## rad      0.67004295 -0.08584353
## tax      1.03992455 -0.58075025
## ptratio  0.25864960 -0.02605279
## black   -0.01657236  0.01975686
## lstat    0.17365575 -0.41704235
## medv    -0.06819126 -0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda_kmeans, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_kmeans, myscale = 4)

In the plot we see two overlapped cluster and one cluster which away from other clusters. The arrows tells us thatnox, zn, tax and medv the most influential variables in the model.

Super Bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)